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Abstract. 

We describe herein the APEhnk+ board, a PCIe interconnect adapter featuring the 
latest advances in wire speed and interface technology plus hardware support for a RDMA 
programming model and experimental acceleration of GPU networking; this design allows us to 
build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of 
our cost-effective, tens-of-thousands-scalable cluster network architecture. 

Some test results and characterization of data transmission of a complete testbench, based 
on a commercial development card mounting an Altera® FPGA, are provided. 



1. Introduction 

The Array Processor Experiment (APE) is a custom design for HPC targeting the field of Lattice 
QCD, started by the Istituto Nazionale di Fisica Nucleare and partnered by a number of physics 
institutions ah over the world, that since its start in 1984 has developed four generations of 
custom machines [H El El S]. Leveraging on the acquired know-how in networking and re- 
employing the gained insights, a spin-off project called APEnet [5^£j developed an interconnect 
board that allows assembling a PC cluster a la APE with off-the-shelf components. 

Following further developments funded by EU projects (FP 6 SHAPES [71 [8] and FP 7 
EURETILE), the APEnet project evolved into APEnet+ [9J; its achievement is the design of the 
APElink+ host adapter, which integrates both a network interface and a switching component, 
bringing in state-of-the-art wire speeds for the links and a PCIe X8 gen2 host interface. With 
this latest push to higher bandwidth, low power and low cost of the data transmission system, 
we are encompassing not only a broader range of intensive numerical algorithms (Lattice QCD 
is our primary but not exclusive concern), but also the field of acquisition systems for modern 
particle and astroparticle experiments (sLHC, ILC, CLIC, NA62. . . ). 

The outlook of this article is as follows: the first section explains the global network 
architecture; the second one gives the details of the host board; the third one outlines the 
software stack provided by the programming environment; the fourth one sketches the current 
deployment of APEnet+ hardware in the framework of our QUOnG HPC initiative; the fifth 
and final one gives conclusions and outlines to future work. 



2. The APEnet+ hardware 

The APEnet+ interconnect is our low latency, high bandwidth packet-based direct network, 
supporting state-of-the-art link wire speeds and a PCIe X8 gen2 host connection. On this 
network, the computing host — e.g. a multi-core CPU optionally paired with GPU — is 
equipped with one APElink+ board and made into a node of the cluster. The nodes are 
connected by point-to-point links to form a 3D torus in a cubic mesh; each node communicates 
with each of its 6 neighbours along the X+, X—, y+, Y—, Z+ and Z— directions by bi- 
directional full-duplex communication channels. 

Size envelope (header+footer) of packets is hard-coded and fixed, while payload size is 
variable; packets are auto-routed to their final destinations according to wormhole dimension- 
ordered static routing, with the system taking care of dead-lock avoidance. 

The hardware block structure, depicted in Figure [T] is split into a so called network interface 
— the packet injection and processing logic comprising PCIe, TX/RX logic, etc. — a router 
component and multiple torus links. 
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Figure 1. Internal FPGA block 
architecture. 



The APElink+ network interface has basically two main tasks: 

• On the transmit data path, it gathers data coming in from the PCIexpress port, fragmenting 
the data stream into packets which are forwarded to the relevant destination ports, 
depending on the requested operation. 

• On the receive side, it provides hardware support for the RDMA programming model, 
implementing the basic RDMA capabilities (PUT and GET semantics) at the firmware 
level. 

Within this block, the addition of a NIOS II 32 bit embedded micro-controller — a standard 
Altera® Intellectual Property — simplifies some tasks along the path of the received packets. 

The routing block takes care of examining the packet header and resolving the destination 
address into a proper path across the switch according to the chosen routing algorithm. 

The torus link block manages the data flow by encapsulating the APEnet+ packets into a 
light, low-level, word-stuffing protocol able to detect transmission errors via CRC. It implements 
two virtual channels |10j and proper flow-control logic on each RX link block to guarantee 
deadlock-free operations. 



3. The APElink+ card 

For the design of the building brick of the APEnet+ infrastructure we leveraged on the most 
recent advances in host interface technology, physical link speed and connector mechanics; the 
result is the latest generation of our hardware, the APElink+ card — see Table [T] — 

The APElink+ card is a single FPGA-based PCI Express board; the employed FPGA device 
is the EP4SGX290, which is part of the Altera® 40 nm Stratix IV device family and comes 



equipped with 36 full-duplex CDR-based transceivers, supporting data rates up to 8.5 Gbps 
each. It also provides a PCIe X8 gen2 interface, which is complemented by a commercial PCIe 
core to allow communication between the host processor and the network. 

Moreover, an Ethernet port is foreseen in order to build an additional, secondary network 
with an offload engine for collective communication tasks. 

3.1. APElink-h physical links 

In the global network structure, each card stands as a vertex of a 3D torus mesh network with 
6 independent point-to-point multiple links channel {i.e. the links between mesh sites). Each 
link is made up of 4 bi-directional lanes bonded together; the automatic alignment logic is our 
original addition. 

Four links out of six are hosted on the main board; two more, say Z+ and Z— , are located 
in a detachable, small daughter-card on the upper level. In this way, the complete card takes 
on two PCI standard slots in a PC chassis, mantaining the chance, if four links are enough, to 
use it in a single slot wide configuration. 

The torus links are 6 independent blocks with 2 virtual channel receive buffers each, added to 
manage deadlock prevention. Proper flow control is maintained via credits handshake between 
a local RX block and the remote TX block; this handshake is embedded in the link protocol 
data layer. The torus link is able to autonomously re-transmit the header and the footer in 
case of transmission errors. Therefore, the protocol assures the delivery of the packet, avoiding 
nonrecoverable situations where badly corrupted packets (with errors in the header or footer) 
pose a threat to the global routing. Packets with payload errors (signaled by the footer) must 
be instead handled at the software level. The chosen CRC polynomial generator is the industry- 
standard, well-known CRC-32. 

3.2. APElink-h routing capabilities 

The router comprises a fully connected, 7-ports-in/7-ports-out switch, plus routing and 
arbitration blocks. The routing block examines the header of each packet and translates its 
destination address to a proper path across the switch; the routing is dimension-ordered, with 
a measured latency of 60 ns. 



Table 1. Evolution of the APElink cards. 





APElink 


APElink+ 


FPGA component 


Altera Stratix S30 


Altera Stratix IV GX 290 


# links 


6 


4/6 


link technology 


external National ser/des 


embedded Altera transceivers 


link cables 


LVDS 


QSFP+ standard 


raw link speed 


6 Gbps 


34 Gbps 


host interface 


PCI-X 133 MHz 


PCIe X8 Gen2 


peak host BW 


IGB/s 


4+4GB/S 



3.3. Test bed and preliminary results 

A schematic view of the complete APEnet+ board is visible in Figure [2j The prototypes will 
be available at February 2011. 
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Figure 3. Assembled 
APElink+ test system. 



Figure 2. APElink+ board. 



A test system has been built in order to develop the FPGA firmware, the PCIe interface and 
the physical layer interconnection technology We used a commercial Altera® development 
kit (equipped with a smaller Altera® Stratix IV GX 230) and a custom-designed daughter-card 
(an HSMC mezzanine designed at LABE in INFN-Roma) hosting 3 QSFP+ connectors and 
some SMA test points. This setup allows us to test the complete communication chain up to a 
bitrate of 24 Gbps for each link. Signal integrity was checked connecting to dedicated SMA test 
points, straight at the output of the FPGA transceivers (see Figure ^ and on the mezzanine 
card (see Figure [s]) after one Samtec® connector, two QSFP+ connectors and 1 m of copper 
QSFP+ cable. 

The link was successfully tested up to 3 Gbps data rate (compared to 8.5 Gbps achievable 
with the Stratix IV embedded transceivers) . Above this limit we found some signal degradation 
probably caused by the tower connector between the Altera development kit and the test 
mezzanine. Investigation is in progress; a likely culprit is the reduced bandwidth (below 5 
GHz) of the 19 mm QTH Samtec® connector, which would be substituted anyway by higher 
bandwidth connectors in the production release of the communication card [12j. 

Characterization of signal integrity (and maximum achievable bandwidth) versus serial trans- 
ceivers pre-emphasis and equalization is still in progress. 




Figure 4. Eye diagram at 
5 Gbps on the development 
kit. 




Figure 5. Eye diagram at 
3 Gbps on the mezzanine card. 



Recovered clock stability was checked transmitting a pseudorandom data stream organized 
in 128 bit wide words over 1 m copper QSFP+ cable and checking the relative phase between 
the input and the output clocks (see Figure [g]). Recovered clock was found stable and in phase 
with the input clock up to 400 MHz. 

Latency was checked transmitting a pseudorandom sequence over 1 m QSFP+ copper cable 



and rising a flag every time a fixed test word is transmitted and received by the serializer and 
the deseriahzer respectively (see Figure [7|. Transmission system latency was found stable up to 
160 MHz transmitting clock. 

Figure 6. Recovered clock Figure 7. Latency measure- 
stability check at 40 MHz. ^^^^ 49 MHz. 



4. The APEnet+ software stack 

All APEnet+ software is available under the GNU GPL Licence and spans across four major 
topics: the firmware software running on the FPGA embedded processor, the Linux kernel driver, 
the application level RDMA library and a MPI implementation, these latter three developed 
and tested under RedHat Enterprise Linux 5. 

Communication primitives (rmda_put () , rdma_get(), rdma_send()), buffer registration 
primitives (register.buf f er () , unregister_buf f er ()) and synchronization primitives 
(wait_event 0) covering a custom subset of the low-level RDMA APIs are made available to 
the application programmer as a highly optimized C language library. On top of these, we built 
a native APEnet+ BTL module for OpenMPI l.X. 

Work is underway [13j on the hardware and software features needed for GPU-initiated 
communications, e.g. providing, using so called PCIe peer-to-peer transactions, a CUDA- 
enabled [H] version of the rdma_put() primitive, in order to avoid intermediate copies onto 
CPU memory buffers. To further reduce overhead, another development oversees the delivery 
of RDMA events by the APElink+ hardware in CPU memory in a way that is accessible from 
within CUD A kernels. 

Another research topic is exposing GPU memory areas as RDMA buffers, in such a way they 
can be target of RDMA PUT and GET operations, even more reducing the latency of network 
operations. To this end, discussions are ongoing with some GPU vendors. 

The firmware software running on the FPGA embedded processor is currently in charge of 
managing the RDMA virtual-to-physical address translation table, but we are exploring new 
ways to exploit it for higher-level tasks. 

5. The deployment initiative 

We are currently exploring interconnection of GPU-equipped systems by means of APEnet+ 
(QUOnG project) to reach the PetaFLOPs range in aggregated computing power and working 
on some GPU-related driver optimizations. For the 2011, our road-map foresees the integration 
of a QUOnG rack, a mesh of computing nodes which are rack-mounted lU systems - based on a 
commodity Intel CPU Xeon 5650 - accelerated via high-end CPUs (Nvidia Tesla C1060/M2050) 
interconnected with the APEnet+ hardware. This system, housed in a single rack of 42U, 
will show a peak performance exceeding 60 TeraFLOPs and a power consumption of less than 
26 KW. Leveraging on APEnet+ network, multiple QUOnG racks can be assembled to push up 
the complete system to PetaFLOPs scale. 




6. Conclusions and future developments 

A first mini-cluster is being assembled together with GPUs and the APElink+ version with 3 
links, for final validation of the firmware, the interconnection and the complete software stack 
on a small size network (2-8 nodes). Synthetic tests, as well as real life simulations, will be 
performed, so to be ready with the 6-links prototype release and eventually a bigger cluster 
deployment. 

The presence on the APEnet+ card of a programmable component of considerable power will 
allow us to explore reconfigurable computing, e.g. accelerating some tasks directly in hardware. 

The needs of a large scale deployment make it necessary for APEnet+ to employ fault- 
tolerance features; we will be adding support for links' self-diagnosis and the capability of routing 
around faulty nodes [1^ \W\ . 
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